Skip to content

Conversation

@sayakpaul
Copy link
Member

What does this PR do?

See: #11948

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu July 28, 2025 14:48
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.


ignore_patterns = ["*.json", "*.md"]
# `model_info` call must guarded with the above condition.
model_files_info = model_info(pretrained_model_name_or_path, revision=revision, token=token)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the purpose of this check is to verify if the necessary sharded files are present in the model repo before attempting a download, presumably to avoid a large download if all files aren't present. If we cannot connect to the hub, we just have to assume the necessary shard files are already present locally.

I think we can just skip this check if local_files_only=True and then check if all the shard filenames are present in the cached_folder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now?

Copy link
Collaborator

@DN6 DN6 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just this is sufficient

if not local_files_only:
    # run model_info check

Run snapshot download

Then after the cached_filenames is created, iterate over the files to verify they exist

for filename in cached_filename:
      if not if not os.path.exists(filename):
           raise EnvironmentError("expected file not present in {cached_folder}")

Copy link
Member Author

@sayakpaul sayakpaul Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't have to run snapshot_download() when local_files_only=False, that might be unnecessary.
  2. Why run snapshot_download() after also running model_info()?
  3. Even if we run snapshot_download() regardless of local_files_only var, I think we should have it inside try-except in case the endpoint cannot be pinged for some reason and raise the ConnectionError as before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if b7af511 resolves this.

Copy link
Member Author

@sayakpaul sayakpaul Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. Let me update. Sorry about the back and forth.

@sayakpaul sayakpaul requested a review from DN6 August 12, 2025 14:57
@sayakpaul
Copy link
Member Author

@DN6 see if the latest changes work for you.

cached_folder = os.path.join(cached_folder, subfolder)

# Check again after downloading/loading from the cache.
model_files_info = _get_filepaths_for_folder(cached_folder)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a new function to walk over the cached folder. If you just iterate over cached_filenames and check if the file exits. Avoid looping over files multiple times this way.

for cached_file in cached_filenames:
      if not os.path.exists(cached_file):
                raise EnvironmentError(f"{cached_file} not found in {cached_folder which is required..." 

@DN6 DN6 merged commit 1b48db4 into main Aug 14, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants